Skip to content

Update Kueue and Jobset controller default limit value #502

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

ycchenzheng
Copy link
Collaborator

@ycchenzheng ycchenzheng commented Jun 16, 2025

Fixes / Features

  • Increase the memory limit of kueue-controller-manager and jobset-controller-manager to 1.2Mib per VM or 4Gi which is greater. The change was originally from b/421199006 and b/418017963. The change is implemented in utils

Testing / Documentation

Testing details.

  • [ y ] Tests pass
  • [ y ] Appropriate changes to documentation are included in the PR

@ycchenzheng
Copy link
Collaborator Author

@SujeethJinesh

Copy link
Collaborator

@RoshaniN RoshaniN left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Zheng!

Had couple of questions.

@ycchenzheng ycchenzheng requested a review from RoshaniN June 17, 2025 01:32
Copy link
Collaborator

@wstcliyu wstcliyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please also reach out to XPK owners for verification. @Obliviour @pawloch00

Copy link
Collaborator

@Obliviour Obliviour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this change, looking forward to avoiding limits as we scale. Can we confirm (if not already) with GKE team that these calculated values are good?

@ycchenzheng
Copy link
Collaborator Author

@SujeethJinesh can you please take a look at this comment?

@DannyLiCom
Copy link
Collaborator

@Obliviour Can you contact the GKE Team? Because we want to also create a tracking buganizer ticket for this issue/question, and then assign it to them.

@SujeethJinesh
Copy link
Collaborator

@Obliviour These values are taken from our previous scale testing results. It's based directly on the number of VMs, and so this calculation is a small overestimate of what's needed. It should be enough.

@ycchenzheng
Copy link
Collaborator Author

@pawloch00 This PR also has integration tests failing. Can you please add me to collaborator like #506 (comment) ?

@pawloch00
Copy link
Collaborator

@pawloch00 This PR also has integration tests failing. Can you please add me to collaborator like #506 (comment) ?

done

@ycchenzheng ycchenzheng merged commit 8890e1a into AI-Hypercomputer:develop Jul 10, 2025
42 of 46 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants